19 research outputs found

    Few-shot learning for fine-grained emotion recognition using physiological signals

    Get PDF
    Fine-grained emotion recognition can model the temporal dynamics of emotions. It is temporally more precise when compared to predicting one emotion for activities (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model. However, the experiments to collect large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose a few-shot learning algorithm EmoDSN which can rapidly converge on a small amount of training data (typically < 10 samples per class (i.e., < 10 shot)) for fine-grained emotion recognition. EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2C) and two-dimensional 5-class (four quadrants of V-A space + neutral, 2D-5C) classification. We get an averaged accuracy of 76.04%, 76.62% and 57.62% for 1D-2C valence, 1D-2C arousal and 2D-5C respectively by using only 5 shot of training data. We also find that EmoDSN can achieve better recognition results trained with fewer annotated samples if we select training samples from the changing points of emotion and the ending moments of video watching

    User centered adaptive streaming of dynamic point clouds with low complexity tiling

    Get PDF
    In recent years, the development of devices for acquisition and rendering of 3D contents have facilitated the diffusion of immersive virtual reality experiences. In particular, the point cloud representation has emerged as a popular format for volumetric photorealistic reconstructions of dynamic real world objects, due to its simplicity and versatility. To optimize the delivery of the large amount of data needed to provide these experiences, adaptive streaming over HTTP is a promising solution. In order to ensure the best quality of experience within the bandwidth constraints, adaptive streaming is combined with tiling to optimize the quality of what is being visualized by the user at a given moment; as such, it has been successfully used in the past for omnidirectional contents. However, its adoption to the point cloud streaming scenario has only been studied to optimize multi-object delivery. In this paper, we present a low-complexity tiling approach to perform adaptive streaming of point cloud content. Tiles are defined by segmenting each point cloud object in several parts, which are then independently encoded. In order to evaluate the approach, we first collect real navigation paths, obtained through a user study in 6 degrees of freedom with 26 participants. The variation in movements and interaction behaviour among users indicate that a user-centered adaptive delivery could lead to sensible gains in terms of perceived quality. Evaluation of the performance of the proposed tiling approach against state of the art solutions for point cloud compression, performed on the collected navigation paths, confirms that considerable gains can be obtained by exploiting user-adaptive streaming, achieving bitrate gains up to 57% with respect to a non-adaptive approach with the same codec. Moreover, we demonstrate that the selection of navigation data has an impact on the relative objective scores

    Weakly-supervised learning for fine-grained emotion recognition using physiological signals

    Get PDF
    Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three datasets collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with fine-grained V-A self-reports show that for subject-independent low/neutral/high V-A classification, EDMIL outperforms the state-of-the-art methods. Our experiments find that weakly-supervised-learning can reduce overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals

    CorrNet: Fine-grained emotion recognition for video watching using wearable physiological sensors

    Get PDF
    Recognizing user emotions while they watch short-form videos anytime and anywhere is essential for facilitating video content customization and personalization. However, most works either classify a single emotion per video stimuli, or are restricted to static, desktop environments. To address this, we propose a correlation-based emotion recognition algorithm (CorrNet) to recognize the valence and arousal (V-A) of each instance (fine-grained segment of signals) using only wearable, physiological signals (e.g., electrodermal activity, heart rate). CorrNet takes advantage of features both inside each instance (intra-modality features) and between different instances for the same video stimuli (correlation-based features). We first test our approach on an indoor-desktop affect dataset (CASE), and thereafter on an outdoor-mobile affect dataset (MERCA) which we collected using a smart wristband and wearable eyetracker. Results show that for subject-independent binary classification (high-low), CorrNet yields promising recognition accuracies: 76.37% and 74.03% for V-A on CASE, and 70.29% and 68.15% for V-A on MERCA. Our findings show: (1) instance segment lengths between 1–4 s result in highest recognition accuracies (2) accuracies between laboratory-grade and wearable sensors are comparable, even under low sampling rates (≀64 Hz) (3) large amounts of neu-tral V-A labels, an artifact of continuous affect annotation, result in varied recognition performance

    Subjective QoE Evaluation of User-Centered Adaptive Streaming of Dynamic Point Clouds

    Get PDF
    Technological advances in head-mounted displays and novel real-time 3D acquisition and reconstruction solutions have fostered the development of 6 Degrees of Freedom (6DoF) teleimmersive systems for social VR applications. Point clouds have emerged as a popular format for such applications, owing to their simplicity and versatility; yet, dense point cloud contents are too large to deliver directly over bandwidth-limited networks. In this context, user-adaptive delivery mechanisms are a promising solution to exploit the increased range of motion offered by 6DoF VR applications to yield gains in perceived quality of 3D point cloud user representations, while reducing their bandwidth requirements. In this paper, we perform a user study in VR to quantify the gains adaptive tile selection strategies can bring with respect to non-adaptive solutions. In particular, we define an auxiliary utility function, we employ established methods from the literature and newly-proposed schemes for distributing the bit budget across the tiles, and we evaluate them together with non-adaptive streaming baselines through subjective QoE assessment. Results confirm that considerable gains can be obtained with user-adaptive streaming, achieving bit rate gains of up to 65% with respect to a non-adaptive approach to deliver comparable quality. Our analysis provides useful insights for the design and development of social VR applications

    Advances in Information Retrieval

    Get PDF
    Personal lifelog archives contain digital records captured from an individual’s daily life, e.g. emails, web pages downloaded and SMSs sent or received. While capturing this information is becoming increasingly easy, subsequently locating relevant items in response to user queries from within these archives is a significant challenge. This paper presents a novel query independent static biometric scoring approach for re-ranking result lists retrieved from a lifelog using a BM25 model for content and content + context data. For this study we explored the utility of galvanic skin response (GSR) and skin temperature (ST) associated with past experience of items as a measure of potential future significance of items. Results obtained indicate that our static scoring techniques are useful in re-ranking retrieved result lists

    Towards a comprehensive model for predicting the quality of individual visual experience

    No full text
    \u3cp\u3eRecently, a lot of effort has been devoted to estimating the Quality of Visual Experience (QoVE) in order to optimize video delivery to the user. For many decades, existing objective metrics mainly focused on estimating the perceived quality of a video, i.e., the extent to which artifacts due to e.g. compression disrupt the appearance of the video. Other aspects of the visual experience, such as enjoyment of the video content, were, however, neglected. In addition, typically Mean Opinion Scores were targeted, deeming the prediction of individual quality preferences too hard of a problem. In this paper, we propose a paradigm shift, and evaluate the opportunity of predicting individual QoVE preferences, in terms of video enjoyment as well as perceived quality. To do so, we explore the potential of features of different nature to be predictive for a user's specific experience with a video. We consider thus not only features related to the perceptual characteristics of a video, but also to its affective content. Furthermore, we also integrate in our framework the information about the user and use context. The results show that effective feature combinations can be identified to estimate the QoVE from the perspective of both the enjoyment and perceived quality.\u3c/p\u3
    corecore